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ABSTRACT 


The author developed an automatic text categorization approach and investigated 
its application upon categorizing emails. The categorization approach is derived from an 
instanced-based learning method that explores conditional probabilities of particular 
words. The effectiveness of the author’s categorization approach using collections from 
a set of emails is then evaluated and assigned a numerical score based upon precision and 
recall. Precision was 65% while recall was 17%. The author’s experiments indicated 
automatic categorization of incoming emails at the client level can categorize email, but 
is difficult when not using a standardized corpus. Word frequency is valuable, but should 
be used in combination with other methods such as phrase extraction for a higher level of 
performance. 
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1. INTRODUCTION 


A. BACKGROUND 

Email has become the standard for fast, inexpensive and easily accessible 
communication. The explosive growth of email is affecting everyone in the Department 
of Defense as well as the civilian work environment. Its largest impact is on management 
and record-keeping personnel. Typical military users receive between 30 - 70 emails a 
day depending on their rank and billet (Marsan, 2002). If one attaches spreadsheets, 
documents, presentations, graphics and executable programs then email gains even more 
value. Because of its ease of use, email has become an integral part of military- 
organization daily operations. All of the information in unclassified and classified emails 
is also a treasure trove of operational data. 

B. THESIS OBJECTIVE 

This thesis seeks to improve the organization of individual user’s email by 
implementing an automated categorizer for email. The author seeks to try to eliminate 
the large amounts of manual email categorization that is currently done by many users. 
This could be useful to military personnel due to efficiency, privacy, and high-turnover 
concerns. Turnover of military personnel happens every 2-3 years and frequently old 
email is simply deleted rather than organized and used to document valuable operational 
processes and data. Generally speaking within the DoD there are no formal filing or 
retention policies for email. There are guidelines such as the DoD Directive 5012.2, but 
these deal with large-scale records management. 

Old email messages are required for numerous purposes. These include day-to- 
day business operations, and requests for, historical financial information, activities, 
logistics, etc. When each individual decides which email messages to retain or delete, 
much information is hidden from the rest of the organization. Email storage of this kind 
is scattered in personal archives. Although servers can store these messages, the number 
of messages is constantly increasing. If backup tapes are not kept for retrieving old 
messages, and an individual goes on leave or is unavailable, critical information in his 
email is unavailable. This suggests organizational-level archiving and categorizing, but 
what can be done at the user level to help organize this information? 
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With the average user manually archiving an estimated 300 megabytes (MB) of 
email a year, categorizing methods can vary dramatically in their effectiveness and 
overall organization (Ferris, 1999). Our primary goal of this research was to survey 
existing methods and determine a method or a combination of methods that would work 
well for email categorization. A secondary goal of this research was to create a tool to 
accurately and quickly categorize and archive email messages at the local user level. 

This thesis will attempt to answer the following questions: 

• Can machine- learning programs accurately categorize e- mail? 

• What are the strengths and weaknesses of automated categorizers? 

C. OUTLINE OF THESIS 

Chapter II describes previous attempts at automated categorizing and other similar 
problems in text retrieval. Chapter III gives a description of the structure and 
components of the program. A detailed description of the data and corpus is also given. 
Chapter IV provides a description of a categorization program that we developed. 
Chapter V discusses the program’s performance and the accuracy of its results. Chapter 
VI reviews the program’s achievements and major weaknesses. 
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11. OTHER WORK ON TEXT CATEGORIZATION 


A. INTRODUCTION 

Text categorization has become a very active research topic over the last few 
years. Many of the approaches seek to categorize documents of the Internet. In this 
thesis email is the document and specifically the text within the email. Categories can be 
summarized using phrases, words, or numerically. Traditionally, a domain expert, 
usually a librarian, does text categorization manually. Documents are read by the expert 
and then placed in the appropriate category. To eliminate the large amount of manual 
effort required, we could use automatic categorization that learns automatically from 
using training examples. The classic approach is to assign weights to particular words in 
particular categories; the inferred category of a document is the one with the highest 
weighted sum (Witten, Frank, 2000). 

Two categorization techniques used are instance-based learning and Naive-Bayes 
probabilistic classification. Instance-based learning methods begin with a particular 
example and generalize it to cover other similar examples in the same category. The 
Naive-Bayes approach uses the conditional probabilities of categories given a word to 
estimate the probabilities of categories given an email document; this model assumes 
word independence. Typically a list of “stop words” to be ignored and some sort of 
destemming algorithm are used to help normali z e the word list. 

B. LITERATURE REVIEW 

Lam, Ruiz, and Srinivasan investigated whether automatic categorization will 
have better retrieval performance than that achieved using manual categorization applied 
to medical documents (Lam, Ruiz, Srinivasan 1999). They analyzed the retrieval 
performance on test queries to gain insights on the interaction of their categorizer and text 
retrieval. The first part of their work dealt with automatic categorization including a 
category-extraction process. For their test documents they use a corpus of medical 
documents from the MEDLINE database that is referred to as the HERSH corpus. 

The authors ran a series of experiments on parameter selection to provide a metric 
and categorization results. Their results are broken down into category and document 
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perspectives. The category perspective results are related to sizes of categories ranging 
from 10 to 60 categories. Three different parameters were tested: CO, C35 and C50. CO 
used all manually assigned categories that existed in the training set and test set. C35 and 
C50 limit the number of categories to those that have a document frequency greater than 
35 or 50 per category. The document frequency is the number of documents that a 
specific category is assigned to. The FI score is a weighted combination of recall and 
precision, with the scores being averaged to determine a mean. Their results for 
parameter selection can be seen in Table 1. 


Run 

Parameter selection based 
on training set 

N 

M 

#of 

categories 

FI score 

CO 

641 

0.258 

5 

50 

C35 

58 

0.468 

5 

40 

C50 

43 

0.509 

30 

20 


Table 1. Results for Parameter Selection in (Lam, Ruiz, Srinivasan 1999). 

The results indicate that as the frequency threshold on the category set increases, 
the mean FI score improves. N represents the number of documents while M was the 
number of categories. 

Yang did a comparative evaluation of statistical approaches to text categorization 
(Yang, 1998). The author uses several versions of the Reuters newswire corpus of 
20,000 documents to evaluate the categorization methods of k-nearest neighbors, simple 
word matching, decision trees, Naive-Bayes, inductive-rule learning in disjunctive 
normal form, neural networks, Rocchio, linear least-squares fit, and “sleeping experts”. 
The authors found that linear least squares fit performed best. 

In addition to experimenting with thresholding techniques, Yang concluded that 
variability on the performance of classifiers with collection is common. Although the 
Word approach, which looks at single word frequency, had increased performance when 
changing from a labeled to an unlabeled corpus, it was still out performed by other 
methods such as kNN and LLSF. 
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Moens and Dumortier applied text categorization to magazine articles to study the 
effects of selection of feature words and proper names (Marie-Francine, Dumortier, 
2000). The authors use a standardized approach of stop-word removal and then select 
keywords by applying statistical weights to the remaining words after stop words are 
taken out. For proper names, words with capitalization are given a heavier weighting. 
Terms with a calculated weight above 0.4 conditional probability are selected. The 
authors also apply the technique of “zoning”, which is the selection of word examples 
that are in close proximity to other word examples within the document. The results of 
the Moens and Dumortier compare a Bayesian independence classifier to the Rocchio 
algorithm and a ^ algorithm. The algorithm is used to test how closely a set of 
observed frequencies corresponds to a set of expected frequencies. The observed 
frequencies are the number of texts relevant or non-relevant for the text category that 
contain the feature word. The authors conclude that the X algorithm worked best with a 
recall of 0.73 and precision of 0.64 versus the Bayesian method recall of 0.58 and 
precision of 0.61 and the Rocchio algorithm recall of 0.64 and precision of 0.57. 

Salton and Buckley propose a similar method to show how similar one document 
is to a query document by statistically weighting terms within the document (Salton, 
Buckley, 1988). The authors compare results of eight different term weighting methods 
on different collections of documents. They make recommendations on query and 
document vectors concerning the term-frequency component, the collection frequency 
and the normalization component. The authors conclude that for short queries each term 
is important and query-term weights are preferred. When dealing with document vectors 
the authors conclude that for technical vocabulary, an enhanced frequency-weighting 
scheme should be used which places terms automatically between 0.5 and 1.0. Our 
application involves short technical documents and can use this approach by using 
individual term weighting rather than using a similarity calculation. 
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m. DESCRIPTION OF APPLICATION 


In this sectbn we describe our application, and present an algorithm for 
categorizing email documents using a probabilistic model. The algorithm uses count 
data, the frequency of the terms in a document. Our approach relies on keyword clues. A 
training process identifies categories for new documents from pre-categorized examples. 
The categorization technique used in the algorithm is a linear numeric prediction model 
(Witten, Frank, 2000). 

A. DATA 

A total of 737 emails were used to train and test the categorizer. Table 3 provides 
the data characteristics of the emails. Some emails were previously saved with an 
“.html” extension and others were saved with a “.txt” extension. HTML tags were 
identified and included in the stop-word list. The collection of emails was from the 
author’s personal work archives. Approximately 20% pertain to the authors experience 
as a Supply Officer; the remaining 80% were collected during the author’s experience as 
a graduate student. 



Number of emails 

737 

Total number of words in text corpus 

31,593 

Total number of words after destemming and stopword 
removal 

20,115 

Total number of unique words after destemming, 
stopword removal and extraction of HTML tags and 
other special characters 

~ 12,000 

Number of words per email document 

45 -1082 


Table 2. Training and Testing Data Characteristics. 


Table 3 shows the specific category descriptions and example relevant words 

which the author thinks the program should choose as keyword clues relating to a 

category. The author identified these categories by placing them in logical categories 
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according to their content. Except for categories #4, #5, #10, #15, and #17, the categories 
could be applied to other military-service emails as well as civilian-business emails. 


Category # 1: Classes 

Emails with administrative course material. 

Possible Key Words 

class, info, course info, homework, homework problems, answers 



Category # 2: Grades 

Emails with information about grades and transcripts. 

Possible Key Words 

grades, registrar, python, final grade, homework grade, test grade 



Category # 3: Personal 

Emails with information received from the author’s family members 
and other matters he deemed not directly connected to his work 
environment. 

Possible Key Words 

trip, Michelle, Abby, love, us me, thanks, dear 



Category # 4: 3270 

Emails dealing with connectivity problems to a mainframe computer 
using 3270 emulation software. 

Possible Key Words 

3270, ACID, password, connectivity, lack, mainframe, service 
(ACID - access control identificatbn) 



Category # 5: Bwd Mess 

Emails that involved the authors job as the Marine Officer in charge of 
collecting wardroom dues onboard the USS Belleauwood from 11/98- 
6/99. 

Possible Key Words 

money, payment, dues, mess, receipt, check 



Category # 6: Equipment 

Emails that dealt with equipment issues. 

Possible Key Words 

swap, equipment, truck, weapons, parts, lead-time, fix, gear 



Category # 7: Equipment 
Allowance 

Emails that deal with equipment allowances and what a unit was 
reporting to have. 

Possible Key Words 

OH, on-hand, own, equipment, T/0, temp, shortage, overage 

(OH and T/0 - stand for on hand and table of organization, which is 

the structuring of a unit) 



Category # 8: Equipment 
Readiness 

Emails that pertain to the physical condition of a piece of equipment 
and whether it was working properly or not. 

Possible Key Words 

maintenance, parts, deadline, fix, repair, leadtime, running, MIMMS 
(Marine Corps Integrated Maintenance Mgt. System) 
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Category # 9: Exercises 

Emails that pertain to military exercises the author partook in and 
various problems that he resolved or worked on. 

Possible Key Words 

Cobra-Gold, billeting, exercise, funding, travel, planning, meeting, 
Y2K 



Category #10: Expeditor 

Emails that pertain to a person whose job involved the explicit 
tracking of equipment through the transportation pipeline. 

Possible Key Words 

tracking, equipment, package, arrival, Carl, time, where 



Category #11: Eiscal 

Emails concerning payment and disbursal matters, contracts, and 
equipment receipts. 

Possible Key Words 

payment, due, money, SABRS, JON, financial, authority 

(SABRS - standard accounting and budgeting requirement system; 

JON - job order number) 



Category #12: General 

Emails concerning general administrative purposes. 

Possible Key Words 

administrative, requirement, meeting, turn-in, due 



Category #13: 

Maintenance 

Emails that pertain to the physical condition of a piece of equipment, 
whether it was working properly or not, and the parts status for a piece 
of equipment. 

Possible Key Words 

maintenance, parts, deadline, fix, repair, lead-time, running, MIMMS, 
status, up 

(MIMMS - Marine Corps Integrated Maintenance Mgt. System) 



Category # 14: 
Miscellaneous 

Emails that were of mixed purposes. They are primarily differentiated 
from category #12 by the variety within each email and that category 
#12 had a general administrative theme. 

Possible Key Words 

odd, here, fun, get, read, keep, future, misc 



Category #15: NBC 

Emails that involved Nuclear, Biological or Chemical (NBC) supply 
issues. 

Possible Key Words 

mask, parts, atropine, injector, NBC, filter, gas, chamber 
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Category #16: records 

Emails concerning supply record administration issues. 

Possible Key Words 

CMR, on-hand, drop, add, quantity, description, account, inventory, 
count 

(CMR - consolidated memorandum receipt) 



Category #17: requests 

Emails concerning supply requests for equipment or the purchasing of 
administration supplies. 

Possible Key Words 

computer, request, get, date, buy, purchase, money, contract 



Category # 18: shipboard 
billeting 

Emails concerning issues involving billeting or berthing for Marine 
Officers on board the USS Belleauwood. Most emails revolve around 
room assignments. 

Possible Key Words 

room, assignment, billeting, berthing, Belleauwood, officer, 
assignment 



Category #19: tech info 

Emails that include technical issues focused around the area of 
computer science. This area was differentiated from category # 1 by 
its lack of specificity in many cases. 

Possible Key Words 

networking, computer, artificial, intelligence, software, web, Internet 


Table 3. Category Descriptions and Possible Keywords Identifying These Categories. 


The exact number of texts used for each category in the training and test sets are 
given in Table 4 below. The training and test sets were formed by placing 80% of each 
category into the training set, and the remaining 20% into the test set. A constraint was 
the limited number of examples in some of the categories. As with much text 
categorization, a central problem is the lack of standard data collections. 
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Category 

Number of email texts for 
the training set 

Number of email texts for 
the test set 

Shipboard billeting 

13 

4 

Requests 

20 

5 

Records 

2 

0 

NBC 

7 

2 

Miscellaneous 

17 

4 

Maintenance 

4 

2 

General 

4 

1 

Fiscal 

20 

4 

Expeditor 

5 

2 

Exercises 

6 

2 

Equipment Readiness 

11 

3 

Equipment Allowances 

8 

3 

Equipment 

11 

2 

BWD Mess 

3 

0 

3270 

9 

3 

Personal 

242 

61 

Grades 

22 

6 

Classes 

104 

26 

Tech Info 

88 

22 

Total 

(Total Emails: 737) 

585 

152 


Table 4. Number of Texts Per Category in Training and Test Sets. 

B. EVALUATION MEASURES 

The author uses conventional measures of recall and precision to measure 
categorization accuracy. They are computed by selecting the highest value for the 
returned email: 


emails correctly assigned to a category 
recall = emails in a category 

precision = emails correctly assigned to a category 
emails assigned to a category 


Table 5. Evaluation Measures. 
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The author also displays the results in the form of a two 19x19 confusion matrices 
that show the true categories plotted against actual categories chosen after being run 
through our categorizer. 

C. METHODOLOGY 

Given an email document, an independent binary classifier compares values and 
chooses the single category of highest value. Two methods of preprocessing text are 
used. First a word “destemmer” algorithm is used (Rowe, 1998). The algorithm removes 
suffixes on an English word to regularize its forms. A sample behavior of Porter’s 
stemming algorithm can be seen in Table 6 below: 


Word 

Porter Stemming Algorithm 

believes 

believ 

working 

work 

starting 

start 

playfully 

play 


Table 6. Porter Stemming Algorithm Behavior. 

In addition to destemming words, a “stop word” removal hst is used (Rowe, 
1998). The stop-word list consists of 700 common words such as “a”, “and”, “the”, “of 
”, etc, that are generally non-informative and can be removed to improve categorization. 
Some html tags and special characters were also added to the stop-word list to eliminate 
redundant non-useful characters such as BR, TR, and other commonly used markup tags. 
Two of the most common words, “of ” and “the”, account for 10% of word occurrences 
in most documents (Mooney, 2002). Sample stop word frequencies can be seen in Table 
7. 
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Frequent word 

Number of occurrences 

Percentage of Total 

the 

7,398,934 

5.9 

of 

3,893,790 

3.1 

to 

3,364,653 

2.7 

and 

3,320,687 

2.6 

in 

2,311,785 

1.8 

is 

1,559,147 

1.2 

for 

1,313,516 

1.0 

The 

1,144,860 

0.9 

that 

1,066,503 

0.8 

said 

1,027,713 

0.8 


Table 7. Sample Word Frequency Data: Frequencies from 336,310 Documents in the 1GB 
TREC Volume 3 Corpus 125,720,891 Total Word Occurrences; 508,209 Unique Words. 


13 



THIS PAGE INTENTIONALLY LEET BLANK 


14 



IV. DESCRIPTION OF TEXT CATEGORIZER PROGRAM 


A. PROGRAM STRUCTURE 

The text categorizer was written in Java. The overall architecture for the program 
can be seen in Figure 1. The program starts by accepting email documents in text 
document format. The training set is manually categorized and then both sets are 
tokenized. The email document is then run through a destemming program and removal 
of any of the 700 stop words is done. All capitalized letters are made into lower case. 
The training set is used in the calculation of clue probabilities. If a word remains after 
stop-word removal it must appear a minimum of 10 times to have its probability 
calculated. Additionally, probabilities are calculated by viewing two subdirectories 
labeled, “yes” and “no” and finding the conditional probability of a “yes” given the 
occurrence of particular. The test set involves the calculation of 19 weighted sums for 
each document. 

The “ClueWords” program was adapted with minor changes from another 
program, “GetClueProbs” (Rowe, 1998). The program was modified to extract stop 
words. The RateDocs program was modified to check for the “Subject” line of an email 
and increment the overall word count. The “RateDocs” program takes the weighted sum 
of the number of occurrences of each clueword, and then chooses the category of highest 
total weight. 

B. PROGRAM PERFORMANCE 

All code was written in Java using JDK1.3.1_02 Java Virtual Machine release. 
The programs were executed on a Pentium II Processor Intel MMX chip running 
Windows 98 with 160 MB of RAM. Table 8 shows processing time. 
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Actions Performed in dotted lined box 
are performed by ClueProbs program. 


Figure 1. 
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Train/Test 

Number of documents 

Real total time 
in seconds 

Real time 
Per 

Document 
in seconds 

Training 

585 

225 

4.2 

Test 

152 

165 

0.9 


Table 8. Processing Time. 
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V. EXPERIMENT RESULTS 


A. INTRODUCTION 

After destemming and eliminating stop-words, the training set consisted of 585 
different email documents containing more than 30,000 words including duplicates. 
Minimum word counts were set at 5, 10 and 15 words per document and run through our 
clue-probability program. The number of clue words after destemming, elimination of 
stop-words, and changing of upper-case letters to lower-case letters ranged from 470 to 
1,866 over the different categories. The top five keywords in conditional probability for 
three sample categories are shown below in Tables 9-11. “Personal” is a very large 
category with lots of keywords and high probabilities. “Fiscal” is a medium-sized 
category with some high-ranking clue words, and some important low ones. “Requests” 
is a small-sized category with low-ranked key words. 


% Prob 

# in category (yes) 

# not in category (no) 

Word 

0.99 

515 5 

hotmail 

0.98 

97 1 

resort 

0.98 

127 2 

sooners90 

0.96 

64 2 

Eove 

0.94 

149 9 

her 


Table 9. Example Cluewords in Category Personal. 


% Prob 

# in category (yes) 

# not in category (no) 

Word 

0.92 

36 

3 

ABC* 

0.51 

23 

22 

fiscal 

0.27 

15 

39 

money 

0.22 

14 

47 

spend 

0.21 

15 

54 

finance 


Table 10. Example Cluewords in Category Eiscal. 


% Prob 

# in category (yes) 

# not in category (no) 

Word 

0.53 

15 

13 

mimm 

0.50 

8 

8 

laptop 

0.27 

5 

13 

gear 

0.25 

4 

12 

machine 

0.12 

34 

230 

request 


Table 11. Example Cluewords in Category Requests. 
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Our classifiers were tested upon 152 new, previously unseen email texts. Table 
12 shows recall and precision for the test set. Table 13 shows a confusion matrix of 19 
categories and the 152 test set documents. This shows which categories were “confused” 
with one another and which categories were clearly identified. 


Category 

Recall 

Precision 

Classes 

0.77 

0.18 

Grades 

1.00 

0.05 

Personal 

0.18 

1.00 

3270 

0.67 

0.07 

BWD Mess 

0.01 

0.00 

Equipment 

0.50 

0.03 

Equipment Allowance 

0.67 

0.05 

Equipment Readiness 

1.00 

0.12 

Exercises 

1.00 

0.06 

Expeditor 

0.50 

0.06 

Eiscal 

0.80 

0.10 

General 

1.00 

0.05 

Maintenance 

1.00 

0.07 

Miscellaneous 

0.80 

0.18 

NBC 

0.50 

0.17 

Records 

Not enough 

Not enough 


records 

records 

Requests 

0.80 

0.11 

Shipboard Billeting 

0.80 

0.08 

Info_Tech 

0.32 

0.78 

Average 

0.65 

0.17 


Table 12. Results of Categorizing Test Set Documents. With Subject Line Code Included. 


20 





Category Name 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

1 

Classes 

113 

8 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

31 

2 

Grades 

30 

111 

0 

9 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

2 

0 

0 

0 

0 

3 

Personal 

4 

0 

110 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

38 

0 

0 

0 

0 

4 

3270 

0 

0 

8 

29 

0 

16 

0 

0 

0 

0 

0 

0 

10 

40 

0 

0 

30 

0 

19 

5 

BWD Mess 

0 

0 

60 

0 

37 

n 





0 

0 

0 

55 

0 

0 

0 

0 

0 






6 

Equipment 

4 

0 

29 

0 

0 

21 

15 

15 

0 

0 

0 

0 

18 

20 

29 

0 

0 

0 

0 

7 

EqmntAll 

0 

0 

0 

0 

0 

20 

40 

21 

11 

0 

0 

0 

11 

30 

0 

0 

0 

0 

20 

8 

EqmntRead 

0 

2 

0 

0 

0 

30 

29 

42 

20 

0 

7 

14 

0 

0 

8 

0 

0 

0 

0 

9 

Exercises 

0 

0 

0 

0 

0 

20 

25 

15 

38 

10 

0 

0 

0 

30 

0 

14 

0 

0 

0 

10 

Expeditor 

0 

0 

0 

0 

6 

15 

LZ 

2Q. 

30 

18 

0 

1 

0 

45 

0 

0 

0 

0 

0 

11 

Fiscal 

0 

0 

0 

0 

0 

16 

0 

0 

0 

0 

71_ 




0 

0 

0 

0 

0 
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General 

0 

0 

0 

50 

0 

7 

0 

0 

0 

0 

( 


19 

0 

63 

0 

7 

2 

0 

4 

13 

Maintenance 

0 

0 

0 

0 

0 

42 

0 

0 

12 

0 

( 


0 

28 

52 

0 

0 

18 

0 

0 

14 

Miscellaneous 

1 

0 

12 

0 

0 

0 

0 

0 

0 

0 

( 


12 

0 

120 

0 

0 

0 

0 

7 


NBC 





















15 

0 

0 

60 

0 

0 

6 

3 

4 

0 

0 

c 


0 

0 

70 

9 

0 

0 

0 

0 

16 

Records * 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

17 

Requests 

0 

0 

40 

0 

0 

20 

0 

15 

0 

0 

13 

17 

0 

0 

0 

0 

47 

0 

0 

18 

Sbrd Billeting 

8 

0 

15 

0 

6 

0 

0 

0 

39 

0 

0 

12 

0 

0 

0 

9 

0 

63 

0 

19 

InfoTech 

70 

0 

60 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

12 

0 

0 

0 

0 

10 


Table 13. 19x19 Confusion Matrix and Potential Category Clusters. 

* Not enough data to accurately categorize 


B. DATA ANALYSIS 

Results indicate that an average of 65% of all documents were correctly classified 
into their respective category. Of the 19 categories, 15 had greater than 50% probability 
of being properly classified. The remaining 4 categories were not properly classified for 
several reasons. In the case of categories #4 (Personal), and #19 (Info_Tech), these were 
frequently confused with one another and with category #14 (Miscellaneous). For 
categories #5 (BWD Mess) and #16 (Records) there were not enough examples to train 
on. Category #19 (Info Tech) was often confused with category #I (Classes). Larger 
categories had better precision; smaller categories demonstrated higher recall. Average 
recall rates were acceptable, but precision rates were disappointing and can be 
contributed to categories clustered together. 

Due to the unique nature of the corpus, each of the categories seemed to have 
certain cluewords that only helped it. In some categories stop words could have been 
good discriminators, such as category #3 (Personal) where the words “can” and “do” 
frequently show up. 
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Results with double weighting of the document “Subject” line show minimal 
increase in overall success probability. Table 14 shows actual number of words in 
categories as compared against the number of non-example documents. The table 
identifies the low number of training examples for smaller categories such as #5 (BWD 


Mess). 


Category Name 

Actual # words in 

Actual # of words in category 

category (yes - examples) 

(no - examples) 

Classes 

78,939 

562,272 

Grades 

25,952 

617,491 

Personal 

417,982 

239,196 

3270 

1,634 

638,416 

BWD Mess 

237 

639,813 

Equipment 

2126 

637,924 

Equipment 

Allowance 

1945 

638,105 

Equipment 

Readiness 

1901 

638,149 

Exercises 

481 

639,569 

Expeditor 

486 

639,564 

Eiscal 

3757 

636,293 

General 

538 

639,512 

Maintenance 

881 

639,169 

Miscellaneous 

3332 

636,718 

NBC 

685 

639,365 

Records 

65 

639,985 

Requests 

3233 

640,050 

Shipboard 

Billeting 

1500 

641,783 

Info_Tech 

115,961 

541,280 


Table 14. Actual Number of Words Identified for Positive Examples Versus Non-Examples. 
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VI. CONCLUSIONS AND RECOMMENDATIONS 


A. CONCLUSIONS 

This thesis examined automatic text categorization of email documents. The use 
of keywords and their conditional probabilities was the primary method used. Final 
recall and precision results were 65% and 17% respectively. A stop-word list and 
destemmer program proved to be very helpful when dealing with text categorization. 

B. RECOMMENDATIONS 

This thesis could be extended by incorporating term phrases to improve 
categorization. If possible, a more standardized corpus of text should be used with 
approximate equal number of documents per category. A program to accurately strip out 
all HTML characters and other special characters for non-text would be helpful. Finally, 
a program or method to deal with personal names might prove very helpful in some 
categories. 
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APPENDIX A. CLUEWORDS SAMPLE OUTPUT 


0.03878220540933329 overall probability, 18146 yes examples, 449749 no 
examples, 

0.0 0 12 shape 
0.0 0 17 shall 
0.024 3 122 write 
0.0 0 72 friend 
0.0 0 37 certificate 
0.0 0 14 comfort 
0.0 0 17 netscape 
0.0 0 43 bwlogu 

0.00392156862745098 1 254 usmc 

0.0 0 14 considerate 

0.05834683954619125 36 581 monterey 

0.0 0 13 justin 

0.06666666666666667 6 84 here 

0.0 0 40 lejeun 

0.0 0 34 hell 

0.0 0 83 effect 

0.0 0 15 comment 

0.0 0 51 sans-serif 

0.0125 1 79 hear 

0.05782060785767235 78 1271 head 
0.0 0 17 friday 
0.0 0 17 urge 

0.06481481481481481 7 101 strategy 

0.06484641638225255 57 822 subject 

0.17073170731707318 7 34 interact 

0.0 0 188 Mike 

0.0 0 18 your-account 

0.0 0 11 extreme 

0.0 0 29 prodigy 

0.0 0 63 Fred 

0.06097560975609756 40 616 http-equiv 
0.0 0 38 script 

0.17391304347826086 4 19 before 

0.0 0 13 high-spee 

0.1111111111111111 2 16 amador 

0.0 0 62 Logue 

0.0 0 12 accommodate 

0.0 0 26 simply 

0.0 0 26 upon 

0.014492753623188406 1 68 federal 
0.0 0 45 false 

0.13333333333333333 2 13 adrian 
0.0 0 17 hidden 


** The list continues for up 38 pages. 
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APPENDIX B. CLUEWORDS PROGRAM 


* Title: ClueWords 

* 

* Description: This class performs two functions. One is to take out 

* each non-stop word and then looks at two subdirectories of "yes" 

* and "no", find the conditional probabilities of "yes" 

* given the occurrence of a particular word. Initial use is to 

* test against emails and try to categorize them appropriately. 

* 

* Elements adapted from Dr. Neil Rowe's programs GetClueProbs 

* and Count Words. 

* 

* Copyright: Copyright (c) 2002 

* Company: USMC NPS 

* ©author Scott R. Hall 

* ©version 1.0 
*/ 

import java.io.*; 
import java.util.*; 

public class ClueWords 

{ 

public static void main (String args[]) throws lOException 

{ 


// - 

// Data Member Declarations 

// - 

* call Parser method. Used only for testing. 

*/ 

// Parser (); 

// Mincount sets the mininum number of times that a word 

// must appear in order to have its' probability calculated. 

double Mincount = 10;//was 10 

// An integer declaration 

intj; 

// Long integer data types for 4 items that allow better 

// granularity for calculating probabilities. 

long Oldcount, Count, wordyescount, wordnocount; 

// Integer declarations and assignment values. 
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int yescount = 0; 
int nocount = 0; 

// Double real number declarations for 5 items that allow 

// Standard Deviation and Probabilit ies to be displayed properly. 

double yesratio, yesprob, Dev, Prob, SD; 

// String characters declared to include a string tokenizer 

// to help extract tokens from emails. 

String Inputline, Word, Stopword; 

StringTokenizer st; 

// File 

File Dir; 

// Declaring and creating HashSet. Implements Set using an 

// internal hashtable. Allows any type of object or null to 

// be a member of the set. There is no guarantee of order 

// for the set elements. There are no duplicates in a HashSet. 

HashSet rchs = new HashSet(); 

// Within the Destemmer class call the hashKnownWords method 

// and pass it results of the rchs. 

Destemmer.hashKnownWords(rchs); 

// Declaration and creation of hashmap. Same thing as a Hashtable 

// but methods are not synchronized. 

HashMap hm = new HashMap(200000); 

// Declarations below imported from CountWords program. 

HashSet hsstop = new HashSet(lOOO); 

FileReader frl = new FileReader("stopwords.txt"); 

BufferedReader brl = new BufferedReader(frl); 

while ((Stopword = brl.readLine()) != null) hsstop.add(Stopword); 

// Creating a new instance of the Directory object and passing 

// it the contents of ''yes'' directory. 

Dir = new File("yes"); 

String Filelist [] = Dir.list(); 

// A "for" loop to go through "yes/" directory and read in 

// flies via buffered reader. 

for (j=0; j<Filelist.length; j++) 

{ 

FileReader fr = new FileReader("yes/" + Filelist(j]); 
BufferedReader br = new BufferedReader(fr); 
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// Inner "while" loop while the buffered reader is not empty(null) 

// create a new String Tokenizer Object and tokenize based on the 

// characters identified. 

while ((Inputline = br.readLine()) != null) 

{ 

St = new StringTokenizer(Inputline," ,.;:'-*?!()[]{ }_+=l\\\"<>/@#&*"); 

// Another inner "while" loop that loops through each token while 

// there are more tokens left to tokenize. If the is not a number 

// string than increment "yescount" and destem the word. 

// 

while (st.hasMoreTokensO) 

{ 

Word = st.nextTokenO; 

if ((Word.length()>l) && (!numberString(Word)) && 

(! hsstop .contains (W ord))) 

{ 

yescount++; 

Word = Destemmer.destem(Word,rchs); 

if (!hm.containsKey(Word)) 

{ 

hm.put(Word,new Long( 1000000)); 


else 

{ 

Oldcount = (Long)hm.get(Word)).longValue(); 
hm.put(Word,new Long( 1 OOOOOO+Oldcount)); 
)//end of last if statement 
)//end of "yescount" if statement 
)//end of second while statement 
)//end of first while statement 
fr.close();//close out of file reader 
)//end of for statement 


// Same statements except for no category. 


Dir = new File("no"); 

String Filelist2 [] = Dir.list(); 
for (j=0; j<Filelist2.length; j++) 

{ 

FileReader fr2 = new FileReader("no/" + Filelist2(j]); 

BufferedReader br2 = new BufferedReader(fr2); 
while ((Inputline =br2.readLine()) != null) 

{ 

St = new StringTokenizer(Inputline," ,.;:'-*?!()[]{ }_+=l\\\"<>/@#&*"); 
while (st.hasMoreTokensO) 

{ 

Word = St.nextTokenO; 

if ((Word.length()>l) && (!numberString(Word)) && 

(! hsstop .contains (Word))) 
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nocount++; 

Word = Destemmer.destem(Word,rchs); 
if (!hm.containsKey(Word)) 
{hm.put(Word,new Long(l)); 

I 


else 

{ 

Oldcount = ((Long)hm.get(Word)).longValue(); 
hm.put(Word,new Long(l+01dcount)); 


fr2.close(); 

I 


PrintWriter fileout = new PrintWriter(new FileWriter("clueprobs.out'')); 

if (nocount > 0) yesratio = (double)yescount/(double)nocount; 

else yesratio = 2.0*(double)yescount; 

yesprob = (double)yescount/(double)(yescount+nocount); 

fileout.println(yesprob + " overall probability," + yescount + " yes examples, '' + nocount 
+ " no examples,"); 

Set set = hm.entrySetO; 

Iterator i = set.iterator(); 

while (i.hasNextO) 

{ 

Map.Entry me = (Map.Entry)i.next(); 

Word = (String)me.getKey(); 

Count = ((Long)me.getValue()).longValue(); 
wordnocount = Count % 1000000; 
wordyescount = (Count-wordnocount)/1000000; 

Dev = (double)wordyescount-(yesratio*(double)wordnocount); 

Prob = (double)wordyescount/(double)(wordyescount+wordnocount); 

SD = Math.sqrt( 1.0/(( 1.0/(double)wordyescount)+( 1.0/(double)wordnocount))); 
if (((wordyescount+wordnocount)>Mincount) & (Math.abs(Dev) > SD)) 
fileout.println(Prob + " + wordyescount + " + wordnocount 
+ " " + Word); 


} 

fileout.close(); 


/* Says whether a string of characters represents an integer or decimal */ 
private static boolean numberString (String S) 

{ 

boolean numberflag = false; 
int N = S.lengthO; 
if (N > 0) 

{ 


int i=0; 
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if (S.charAt(O) == i=l; 
charC; 

numberflag = true; 

while ((numberflag) & (i<N)) 

{ 

C = S.charAt(i); 

numberflag = (((C >= ’O') & (C <= '9')) I (C == 
i++; 


return numberflag; 


//For now I am remarking this call out 
/* 

public static void Parser () 

{ 

Parser parser = new Parser (''yessupplyfiscal.txt''); 

I 

*/ 

} 
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APPENDIX C. RATEDOCS PROGRAM 


// Given a directory "unknown" of files of unknown relevance, rates 
// each document for the appearance of clues in the clueprobs.out file. 

// Author: Neil C. Rowe, 9/01. Modified with permission by Scott R. Hall 
import java.io.*; 
import java.util.*; 


class RateDocs 

{ 

public static void main (String args[]) throws lOException 

{ 

int j, kl, k2, Wordcount, M; 

double yesratio, Dev, Prob, SD, Average, Total; 

Double DProb; 

String Inputline, Word, Probstring; 

StringTokenizer st; 

File Dir; 

HashSet rchs = new HashSet(); 

Destemmer.hashKnownWords(rchs); 

HashMap hm = new HashMap(200000); 

FileReader fr; 

BufferedReader br; 

String tempString = new String ();//temporary hold string object for subject line 

String subjectLine = new String (); 

boolean foundSubj = false;//flag set to find subject line 

FileReader frprobs = new FileReader(''clueprobs.out 

BufferedReader brprobs = new BufferedReader(frprobs); 

Inputline = brprobs.readLine(); 
k2 = Inputline.lastIndexOfC '); 
kl = Inputline.lastIndexOfC ',k2-l); 

double Totalprob = Double.valueOf(Inputline.substring(kl+l,k2)).doubleValue(); 
while ((Inputline = brprobs.readLine()) != null) 

{ 

kl = Inputline.indexOfC '); 
k2 = Inputline.lastIndexOfC '); 

M = Inputline.lengthO; 

Probstring = Inputline.substring(0,kl); 

Prob = (Double.valueOf(Probstring).doubleValue()) - Totalprob; 

Word = Inputline.substring(k2+l,Inputline.lengthO); 
hm.put(Word, new Double(Prob)); 

I 

frprobs.close)); 


Dir = new File(''unknown'');// Begin unknown directory here 
String Filelist [] = Dir.list(); 
for (j=0; j<Filelist.length; j++) 


Wordcount = 0; 

Total = 0.0; 

fr = new FileReader)"unknown/" + Filelist(j]); 

br = new BufferedReader(fr); 

while ((Inputline = br.readLine))) != null) 

{ 
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St = new StringTokenizer(Inputline," ,.;:'-*?!()[]{ }_+=l\\\"<>/@#&*"); 
boolean foundSubj = false; 

//use "Subj:" forhtml 

if (IfoundSubj && Inputline.indexOf("Subject:") >= 0 ) 

{ 

subjectLine = Inputline; 

foundSubj = true;//change flag to true 

Wordcount++; 


PrintWriter fileout = new PrintWriter(new FileWriter("RATEDOCS.out'')); 

while (st.hasMoreTokensO) 

{ 

Word = st.nextTokenO; 

if ((Word.length()>l) && (!numberString(Word))) 

{ 


Word = Destemmer.destem(Word,rchs); 

Wordcount++; 
if (hm.containsKey(Word)) 

{ 

DProb = (Double)hm.get(Word); 

System.out.println(DProb + '' retrieved for '' + Word); 

Total = Total+(DProb.doubleValue()); 

fileout.println(DProb + '' retrieved for " + Word); 

} 


fileout.close)); 

)//outer if statement for subject line find 
) //while close 

fr.closeO; 

Average = Total/(double)Wordcount; 

System.out.println)Average + '' strength for document'' + Filelist[j]); 


/* Says whether a string of characters represents an integer or decimal */ 
private static boolean numberString (String S) 

{ 

boolean numberflag = false; 
int N = S.length)); 
if (N > 0) 

{ 

int i=0; 


if (S.charAt(O) == '-') i=l; 
charC; 

numberflag = true; 
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while ((numberflag) & (i<N)) 

{ 

C = S.charAt(i); 

numberflag = (((C >= ’O') & (C <= '9')) I (C == 
i++; 


return numberflag; 
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APPENDIX D. SAMPLE RUN EROM RATE DOCS 


Test Set - Rate Docs for grades run against clueprobs 

0.9326399520216222 retrieved for qpr 
0.9326399520216222 retrieved for qpr 
0.9326399520216222 retrieved for qpr 
0.9326399520216222 retrieved for qpr 
0.9326399520216222 retrieved for qpr 
0.9326399520216222 retrieved for qpr 
0.848555867937538 retrieved for nw3230 
0.6519746713563415 retrieved for logistic 
0.6172012256239916 retrieved for grade 
0.6172012256239916 retrieved for grade 
0.6172012256239916 retrieved for grade 
0.6172012256239916 retrieved for grade 
0.6172012256239916 retrieved for grade 
0.6172012256239916 retrieved for grade 
0.6172012256239916 retrieved for grade 
0.6172012256239916 retrieved for grade 
0.6172012256239916 retrieved for grade 
0.6172012256239916 retrieved for grade 
0.6172012256239916 retrieved for grade 
0.6172012256239916 retrieved for grade 

* Document continues for up to 38 pages. 
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APPENDIX E. PARSER PROGRAM 


* Title: Parser 

* Description: (1) Reads a text file (emails saved as *.txt file) 

* (2) Finds Subject line and parses it to find keywords 

* (3) Reads entire file and counts the frequncy of occurance of key words in 

* (4) Prints subject line keywords & freqs to screen 

* 

* Some elements adapted from Steve Simmon’s Parser program 

* ©version 1.0 
*/ 

import java.io.*; 
import java.util.*; 

public class Parser { 

//Class Variables (Global) 

String fileName = new String)); 

StringBuffer filetext = new StringBuffer(); 

String subjectLine = new String)); 

String keyClueWord = new String)); 

Vector subjKeywords = new Vector)); 

// Constructor 

public Parser(String fileinput) { 

//get the filename from the commandline argument 
fileName = fileinput; 

ReadFile)); 

ParseSubject)); 

ParseEmailText)); 

//ReadClueProbs)); 

) //end Constructor 


// Method: ReadFile 

void ReadFile)) { 

String tempString = new String)); 
boolean foundSubj = false; 
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try{ 


BufferedReader fileReader = new BufferedReader(new FileReader(fileName)); 
while(fileReader.ready()) 

{ 


//Read each line of the email text file & store in string buffer 
tempString = fileReader.readLine(); 

//make all lowercase 

tempString = tempString.toLowerCase(); 


//Find subject line, change it from just "subj” 
If(!foundSubj && tempString.indexOf(''subj'') >= 0 ) 
{ 

subjectLine = tempString; 
foundSubj = true; 

//Debug print out 

System.out.println("Subject line: '' + subjectLine ); 


//add line read to String Buffer, goes to frequency count 
filetext.append(tempString); 

) //end While 

) //end try stmt 

//Opening a file via FileReader object can throw FileNotFound Exception 
catch(FileNotFoundException fileEX) { 

1 


//Reading text in from a file can throw an lOException 
catch(IOException IOEX){ 

} 


) //end method ReadEile 


// Method: ParseSubject 

void ParseSubjectO 

{ 


//create a String Tokenizer from the string that is the subject line 
//default tokinizing is to break string into words 
StringTokenizer subjectWords = new StringTokenizer(subjectLine); 
String tempString = new StringO; 
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while (subj ectWords. hasMoreTokens ()) 

{ 


tempString = subjectWords.nextToken(); 
//Debug print out 

System.out.println("Token: " + tempString ); 

//check to see if word id a key word; 

//if the word is a keyword, add to vector 
if(KeyWord(tempString)) 

{ 


KeyWord temp = new KeyWord(tempString); 
subjKeyWords.add(temp); 


) //end While 

) //end method ParseSubject 


// Method: KeyWord 


boolean KeyWord(String text){ 
boolean IsKeyWord = true; 

String smallWords[] = {''and'', "the", "a", 
"subject", "subj", "re", 

String tempString = new StringO; 


"an", "if", "it", "is", "this", 

" ", "FW:", 

"fwd:"}; 


for(int i = 0; i < smallWords.length; i++) 

{ 

tempString = smallWords[i]; 


if(text. starts With(smallW ords [i])) 

{ 

IsKeyWord = false; 
break; 

} 


) //end for loop 
return IsKeyWord; 

) //end method KeyWord 

// Method: ParseEmailText 

void ParseEmailTextO 

{ 
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Iterator KeyWordITR = subjKeyWords.iterator(); 

KeyWord tempKeyWord = new KeyWord("txt'');//dummy variable 

int counter = 1; 

int keyWordFreq = 0; 

while(KeyWordITR.hasNext()) 

{ 

//Get keyword from vector keywords in subject 

tempKeyWord = (KeyWord) KeyWordITR.next(); 

//reset Freq 

keyWordFreq = 0; 

//Get Freq for this word 

keyWordFreq = getFrequency(tempKeyWord.keyword, filetext.toStringO); 
tempKeyWord. frequency = keyWordFreq; 
if (KeyWordFreq > 2) 

{ 


//print out result to screen 

System.out.println("Subject Keyword " + counter++ + " + tempKeyWord.key word + 

'' Frequency: '' + tempKeyWord.frequency); 

) //end while stmt 
) //end Method ParseEmailText 

// Method: getFrequency 

int getFrequency(String keyWord, String file) 

{ 

int count = 0; 
int index = -1; 

index = file.indexOf(keyWord); 

//1st occurance of keyword found 
if(index >= 0) 

{ 


//increment count and make recursive call to this function with remaining 
//text less all words up to and including the found keyword occurance 
count = 1 + getFrequency(keyWord, file.substring/index + keyWord.length)))); 


return count; 

I 

// Method: main 

public static void main(String[] args) 

{ 

Parser parserl = new Parser(''testparser.txt");//was args[0] or''cs4556_8.txt'' 

I 

} 
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